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Abstract 

Language  modelling  is  a  fundamental  building 
block  of  natural  language  processing.  However, 
in  practice  the  size  of  the  vocabulary  limits  the 
distributions  applicable  for  this  task:  specifi¬ 
cally,  one  has  to  either  resort  to  local  optimiza¬ 
tion  methods,  such  as  those  used  in  neural  lan¬ 
guage  models,  or  work  with  heavily  constrained 
distributions.  In  this  work,  we  take  a  step  to¬ 
wards  overcoming  these  difficulties.  We  present 
a  method  for  global-likelihood  optimization  of 
a  Markov  random  field  language  model  exploit¬ 
ing  long-range  contexts  in  time  independent  of 
the  corpus  size.  We  take  a  variational  approach 
to  optimizing  the  likelihood  and  exploit  underly¬ 
ing  symmetries  to  greatly  simplify  learning.  We 
demonstrate  the  efficiency  of  this  method  both 
for  language  modelling  and  for  part-of-speech 
tagging. 

1.  Introduction 

The  aim  of  language  modelling  is  to  estimate  a  distribution 
over  words  that  best  represents  the  text  of  a  corpus.  Lan¬ 
guage  models  are  central  to  tasks  such  as  speech  recogni¬ 
tion,  machine  translation,  and  text  generation,  and  the  pa¬ 
rameters  of  these  models  are  commonly  used  as  features 
or  as  initialization  for  other  algorithms.  Examples  include 
the  word  distributions  learned  by  topic  models,  or  the  word 
embeddings  learned  through  neural  language  models. 

Central  to  the  language  modelling  problem  is  the  challenge 
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of  scale.  It  is  typical  for  languages  to  have  vocabularies  of 
hundreds  of  thousands  of  word  types,  and  language  models 
themselves  are  often  estimated  on  corpora  with  billions  of 
tokens  (Graff  et  ak,  2003).  The  scale  of  the  problem  inher¬ 
ently  limits  the  types  of  distributions  that  can  be  effectively 
applied. 

In  practice,  the  most  commonly  used  class  of  language 
models  are  n-gram  models.  These  represent  the  probability 
of  the  next  word  as  a  multinomial  distribution  conditioned 
on  the  previous  context.  The  parameters  of  this  class  of 
models  can  be  very  efficiently  estimated  by  simply  collect¬ 
ing  sufficient  statistics  and  tuning  a  small  set  of  parameters. 

More  recently  neural  language  models  (NLMs)  have 
gained  popularity  (Bengio  et  ak,  2006;  Mnih  &  Hinton, 
2007).  These  models  estimate  the  same  distribution  as  n- 
gram  models,  but  utilize  a  non-linear  neural  network  pa¬ 
rameterization.  NLMs  have  been  shown  to  produce  com¬ 
petitive  results  with  n-gram  models  using  many  fewer  pa¬ 
rameters.  Additionally  the  parameters  themselves  have 
proven  to  be  useful  for  other  language  tasks  (Collobert 
et  ak,  2011).  Unfortunately  training  NLMs  can  be  much 
slower  than  n-gram  models,  often  requiring  expensive  gra¬ 
dient  computations  for  each  token;  techniques  have  been 
developed  to  speed  up  training  in  practice  (Mnih  &  Hinton, 
2009;  Mnih  &  Teh,  2012). 

In  this  work,  we  consider  a  different  class  of  language  mod¬ 
els.  Instead  of  estimating  the  local  probability  of  the  next 
word  given  its  context,  we  globally  model  the  entire  corpus 
as  a  Markov  random  field  (MRF)  language  model.  Undi¬ 
rected  graphical  models  like  MRFs  have  been  widely  ap¬ 
plied  in  natural  language  processing  as  a  way  to  flexibly 
model  statistical  dependencies;  however,  MRFs  are  rarely 
used  for  language  modelling  since  estimating  their  param¬ 
eters  requires  computing  a  costly  partition  function. 
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Our  contribution  is  to  provide  a  simple  to  implement  algo¬ 
rithm  for  very  efficiently  estimating  this  class  of  models. 
We  take  a  variational  approach  to  the  optimization  prob¬ 
lem,  and  devise  a  lower  bound  on  the  log-likelihood  using 
lifted  inference.  By  exploiting  the  problem’s  symmetry,  we 
derive  an  efficient  approximation  of  the  partition  function. 

Crucially,  each  step  of  the  final  algorithm  has  time  com¬ 
plexity  of  0{KC^)  where  K  is  the  size  of  the  n-gram  con¬ 
text  and  C  is  the  size  of  the  vocabulary.  Note  that  besides 
collecting  statistics,  this  algorithm  has  no  time  dependence 
on  the  number  of  tokens,  potentially  allowing  its  estimation 
speed  to  scale  similarly  to  n-gram  models. 

Experimentally,  we  demonstrate  the  quality  of  the  models 
learned  by  our  algorithm  by  applying  it  to  a  language  mod¬ 
elling  task.  Additionally  we  show  that  this  same  estimation 
algorithm  can  be  effectively  applied  to  other  common  se¬ 
quence  modelling  tasks  such  as  part-of-speech  tagging. 

2.  Background 

Notation  We  denote  sequences  by  bold  variables:  t  = 
(fi, . . . ,  tn).  A  sub-sequence  will  be  defined  as  either  = 
(ti  1  f  i-t-1  j  •  -  •  ;  )  ttr  t  —  (f  1 ,  .  .  .  ,  1 ,  ,  .  .  .  ,  tjif 

Contextual  language  models  Let  us  first  define  the  class 
of  contextual  language  models  as  the  set  of  distributions 
over  words  conditioned  on  a  fixed-length  left  context.  For¬ 
mally  this  is  an  estimate  of  where  U  is  the  cur¬ 

rent  word  and  K  is  the  size  of  the  context  window.  For  a 
basic  n-gram  language  model,  this  is  simply  a  multinomial 
distribution,  and  the  maximum-likelihood  estimate  can  be 
computed  in  closed-form  from  the  statistics  of  the  corpus 
(although  in  practice  some  smoothing  is  often  employed). 

A  neural  (probabilistic)  language  model  (NLM)  is  a  con¬ 
textual  language  model  where  the  word  probability  is  a 
non-linear  function  of  the  context  estimated  from  a  neu¬ 
ral  network.  In  this  work,  we  will  focus  specifically  on 
the  class  of  NLMs  with  potentials  that  are  bilinear  in  the 
context  and  predicted  word,  such  as  the  log-bilinear  lan¬ 
guage  model  of  Mnih  &  Hinton  (2007).  This  model  is 
parametrized  as: 

= ^ ^  (1) 

where  U  G  W  G  and  E  G 

are  the  parameters  of  the  model,  and  Z(tlZx)  ^  local 
normalization  function  (dependent  on  the  context).  Specif¬ 
ically,  Uf.  and  Wt^  are  the  left  and  right  embeddings  re¬ 
spectively  of  token  ti,  is  a  distance-dependent  transition 
matrix,  C  is  the  size  of  the  vocabulary  and  D  C  is  the 
size  of  the  low-rank  word  embeddings.  Another  form  of 


bilinear  model  is  the  variant  used  in  Word2Vec  (Mikolov 
etal.,  2013): 


exp 


Z{i^ 


^i+K\ 
‘'i-l-l  > 


(2) 


These  models  have  been  shown  to  give  similar  results  to  n- 
gram  models  while  providing  useful  word  representations. 
However,  the  local  normalization  means  that  most  opti¬ 
mization  methods  need  to  look  at  one  token  at  a  time,  and 
scale  at  least  linearly  with  the  size  of  the  corpus. 


Markov  random  fields  To  avoid  this  issue  of  local  nor¬ 
malization,  we  model  the  entire  corpus  as  a  sequence  of 
random  variables  Ti . . .  Tjv,  for  which  we  give  a  joint, 
globally  normalized  distribution.  We  specify  this  distribu¬ 
tion  with  a  Markov  random  field. 

A  Markov  random  field  is  defined  by  a  graph  structure  Q  = 
(V,  £),  and  a  set  of  potentials  {9'^)cgC^  where  C  is  defined 
as  the  set  of  cliques  in  graph  Q. 

Let  t  denote  a  specific  assignment  of  Ti . . .  Tn,  and  let 
tc  =  (fi)iGc-  The  log-probability  of  a  sequence  t  is  then: 

log(p(t;6»)) 

cGC 


where  A{9)  is  called  the  log-partition  function,  and  can  in 
general  be  computed  exactly  with  a  complexity  exponential 
in  the  size  of  the  tree- width  of  Q. 

3.  MRF  Language  Models 

Sentence  model  Building  on  this  formalism,  let  us  now 
define  a  family  of  MRF  distributions  over  text.  We  start 
by  considering  a  sequence  x  =  (xi, . . . ,  x„)  of  n  variables 
with  state  space  X.  We  define  an  order  K  Markov  sequence 
model  as  a  Markov  Random  field  where  each  element  of  the 
sequence  is  connected  to  its  K  left  and  right  neighbours. 

For  simplicity  of  exposition,  we  restrict  our  description  in 
the  rest  of  this  paper  to  pairwise  Markov  sequence  models, 
for  which  only  cliques  of  size  2  (edges)  have  potentials: 
9  =  {9^'^’^'^\{i,  j)  G  £}.  The  lifted  inference  method,  how¬ 
ever,  can  be  easily  extended  to  higher-order  potentials. 


Figure  1.  The  sentence  distribution  model  for  M  =  4. 
Following  the  notation  introduced  in  Section  2,  this  gives 
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us  the  following  distribution  over  A"": 

n-K  K 

Vx  e  A'",  (x))  =  5]  ^  -  A{9) 

i=l  1=1 


Let  T  denote  the  vocabulary  of  our  text  corpus.  In  this 
work,  we  define  the  context  of  a  word  as  its  K  left  and  right 
neighbouring  tokens.  By  adding  K  “padding”  or  “separa¬ 
tor”  tokens  (S')  ^  T  to  the  left  and  right  of  the  sentence, 
this  notion  of  context  also  allows  us  to  bias  the  distribution 
of  tokens  at  the  beginning  and  end  of  the  sentence. 

Let  X  =  TU{(S)}.  A  sentence  t  of  length  M  is 
then  implicitly  mapped  to  a  sequence  x(t)  e  A’^+^^ 
by  adding  the  start  and  end  (S)  tokens.  Now,  letting 
S  =  {x(t)  |t  €  T^},  the  order  K  Markov  sequence  model 
allows  us  to  define  the  following  distribution  over  sentences 
of  length  M,  as  illustrated  in  Figure  1 : 

F""(t)=F™(x  =  x(t)|xGS)  (3) 


This  gives  rise  to  the  following  generative  model  for  a  sen¬ 
tence: 

1.  Sample  the  sentence  length  M  ~  t{M), 

2.  Sample  M  tokens:  (L, . . .  jIm)  ~ 

where  t  is  any  distribution  over  integers  and  can  easily  be 
fit  to  the  data.  We  focus  in  this  work  on  learning  the  pa¬ 
rameters  of  . 

These  graphical  models  define  a  large  family  of  log- 
linear  distributions,  depending  on  the  value  of  K  and  the 
parametrization  of  the  edge  log-potentials  6.  In  the  appli¬ 
cations  that  follow  6  will  either  be  defined  (and  optimized 
over)  explicitly,  or  represented  as  a  product  of  low-rank 
matrices.  We  will  show  how  to  optimize  the  likelihood  of 
the  corpus  for  both  these  settings. 


Low  rank  Markov  random  fields  We  now  consider  dif¬ 
ferent  low-rank  realizations  of  the  log-potentials  9.  Sup¬ 
pose  that  0®’-’  only  depends  on  \  j  —  i\,  that  is  to  say,  param¬ 
eters  are  shared  across  edges  of  the  same  length  (in  which 
case  we  shall  simply  write  0®’^  =  we  can  have  for 

example:  ^b-®!  =  [/,  W, . 


9^'  =  Ut^wlp' 


(4) 

(5) 


One  interesting  property  of  these  models  is  that  since  the 
Markov  blanket  of  a  word  consists  only  of  its  immediate 
neighbours,  its  conditional  likelihood  can  be  expressed  as: 


K 


p{u\t  ‘)  =p(L|t®_)^,t®+f)  (X  exp(^6»)^_^  +^u,u+i) 


1=1 


This  class  of  probability  functions  corresponds  to  those 
defined  by  a  bi-directional  log-bilinear  neural  language 
model.  The  model  in  Equation  4  (9  =  URW)  can  easily 
be  rewritten  in  terms  of  the  bi-directional  version  of  Mnih’s 
LBL  (Mnih  &  Teh,  2012)  and  the  model  in  Equation  5, 
which  we  use  in  the  rest  of  this  work,  is  a  slightly  more  gen¬ 
eral  (distance-dependent)  version  of  the  Word2Vec  CBOW 
model  from  (Mikolov  et  al.,  2013). 

Conversely,  log-bilinear  NLMs  can  be  seen  as  optimizing 
the  pseudo-likelihood  (defined  as  nr=i  ^(^*1^  *))  of  an  or¬ 
der  K  Markov  sequence  model  as  defined  above.  Since 
the  pseudo-likelihood  is  a  consistent  estimator  of  the  like¬ 
lihood  (Besag,  1975),  we  expect  our  factorization  to  have 
properties  similar  to  those  of  the  embeddings  learned  by 
log-bilinear  neural  language  models. 

4.  Efficient  Learning  Using  Lifted  Variational 
Inference 

We  now  outline  our  method  for  optimizing  the  likelihood 
of  a  corpus  under  our  class  of  models.  Learning  undirected 
graphical  models  is  challenging  because  of  the  global  nor¬ 
malization  constant,  or  partition  function.  We  derive  a 
tractable  algorithm  by  using  a  variational  approximation: 
we  define  a  lower  bound  on  the  data  likelihood  (Wainwright 
et  al.,  2005;  Yanover  et  al.,  2008),  and  alternate  between 
finding  the  tightest  version  of  that  bound  and  taking  a  gra¬ 
dient  ascent  step  in  the  parameters  of  the  model. 

The  novelty  of  our  method  comes  from  the  fact  that  for 
the  bound  we  define,  both  the  tightening  and  gradient  step 
only  require  us  to  consider  K  pairwise  moments,  i.e.  the 
running  time  of  learning  will  be  independent  of  the  size 
of  the  corpus.  We  achieve  this  by  showing  how  to  reduce 
the  learning  task  to  lifted  variational  inference,  allowing  us 
to  build  upon  recent  work  by  Bui  et  al.  (2014).  We  then 
derive  an  algorithm  to  efficiently  perform  lifted  variational 
inference  using  belief  propagation  and  dual  decomposition. 
The  overall  learning  algorithm  is  simple  to  implement  and 
runs  very  fast. 

4.1.  Creating  Symmetry  using  a  Cyclic  Model 

Given  a  corpus  =  (t^, . . .  of  ric  sentences  drawn 
independently  from  our  model,  we  wish  to  maximize 
its  likelihood  =  n”=iF(t*)  =  Iir^i  (t®). 

We  first  show  how  to  obtain  a  symmetric  lower  bound  on 
the  likelihood  nr=i  P^'  (^*)- 

Consider  the  sequence  x)!*®)  €  X^  obtained  by  adding  K 
(S)  tokens  before  the  first  and  between  any  two  adjacent 
sentences  in  f®,  where  N  =  +  ^)-  Let  pcyci^ 

be  the  wrapped  around  version  of  Pseq^,  as  illustrated  in 
Eigure  2.  We  have  the  following  result  (the  proof  can  be 
found  in  the  supplementary  materials): 
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Figure  2.  The  cyclic  model  with  =  16  and  a  setting  s  G  5 
corresponding  to  a  corpus  of  two  sentences. 


Lemma 

Then, 


1.  Let  S  =  {x(t'=)|t‘=  G  X  ...  X 


f  L  Q 

2=1 


Hence,  the  cyclic  model  (^(^'^))  provides  a  lower 
bound  on  p(t'^),  which  happens  to  be  invariant  to  rotations; 
the  rest  of  the  paper  makes  use  of  the  symmetry  to  max¬ 
imize  this  lower  bound.  Let  denote  single  node  unary 
potentials,  and  9^  the  edge  potentials  as  defined  in  Section 
3.  The  objective  we  want  to  optimize  is: 

log(p^ci^(x(t‘=)))  =  ^ 


2=1 


/=! 


where 


N 


K 


A{9)  =  log(  exp(^  Py,  +  H 


Vi+l 


D) 


yex’' 


i=l 


1=1 


CC  D  M.,  this  gives  us  an  upper  bound  on  the  original  so¬ 
lution.  Second,  we  replace  the  entropy  iT(/r)  with  the  tree- 
reweighted  (TRW)  upper  bound  (Wainwright  et  ah,  2005): 

N  /  K 

iT(/i)  <  =  Y,  H{gP)  -  Y 

i=l  \  l=l 

where  pij  denotes  the  probability  of  edge  (i,  j)  appearing 
in  a  covering  set  of  forests  for  the  MRF.  Let: 

-t- 

Using  this  variational  approximation,  we  now  have  an  up¬ 
per  bound  on  the  log-partition  function  which  can  be  com¬ 
puted  by  solving  a  convex  optimization  problem.  Al¬ 
together  this  then  gives  us  the  following  tractable  lower 
bound  on  the  log-likelihood: 


N  ( 

A{9-,p)  =  max^  f  (,p\9'^) 


K 


(M 


2,2  +  / 


/  =  ! 


log(p^cl^(x)) 


2=1  \  /=!  / 
=  'C(6i,x;p). 


A(9-p) 

(6) 


Learning  using  gradient  ascent  then  requires  that  we  com¬ 
pute  the  derivative  of  A{9\  p),  which  we  will  show  is  the 
p  that  maximizes  the  variational  optimization  problem  (we 
return  to  this  process  in  more  detail  in  the  next  section). 
We  can  therefore  reduce  the  learning  task  to  that  of  re¬ 
peatedly  performing  approximate  inference  using  TRW. 
Fast  combinatorial  solvers  for  TRW  exist,  including  tree- 
reweighted  belief  propagation  (Wainwright  et  ah,  2005), 
convergent  message-passing  based  on  geometric  program¬ 
ming  (Globerson  &  Jaakkola,  2007),  and  dual  decomposi¬ 
tion  (Janesary  &  Matz,  2011),  which  all  have  complexity 
linear  in  the  size  of  the  corpus. 


andV/  G  {1, . . . ,  K},  xn+i  =  xi  and  yN+i  =  yi- 

4.2.  Variational  lower  bound 

Unfortunately,  the  partition  function  A  is  extremely  costly 
to  compute  for  any  reasonable  vocabulary  size,  as  dynamic 
programming  would  have  running  time 
However,  it  is  easy  to  formulate  upper  bounds  on  A,  which 
give  rise  to  a  family  of  lower  bounds  on  the  log-likelihood. 
We  start  by  using  an  equivalent  variational  formulation  of 
the  partition  function  as  an  optimization  problem: 

^  / 

A{9)  =  max^  {p\0°)  + 

IIGM  ^ '  \ 
i=l  \ 

where  A4  denotes  the  marginal  polytope  (Wainwright  & 
Jordan,  2008).  We  then  make  two  approximations  to  make 
solving  this  optimization  problem  easier.  First,  we  re¬ 
place  M  with  the  local  consistency  polytope  CC.  Since 


However,  we  next  show  that  by  taking  advantage  of  the 
symmetries  present  in  the  optimization  problem,  it  is  pos¬ 
sible  to  solve  it  in  time  which  is  independent  of  N,  the 
number  of  word  tokens  in  the  corpus. 

4.3.  Lifting  the  objective 

Our  key  insight  is  that  because  of  the  parameter  sharing  in 
our  model,  each  of  the  random  variables  in  the  cyclic  model 
are  indistinguishable.  More  precisely,  there  is  an  automor¬ 
phism  group  of  rotation  which  can  be  applied  to  the  suf¬ 
ficient  statistic  vector  and  to  the  model  parameters  which 
does  not  change  the  joint  distribution  (Bui  et  ah,  2013). 

When  such  symmetry  exists,  Bui  et  al.  (2014)  show  that 
without  loss  of  generality  one  can  choose  the  edge  appear¬ 
ance  probabilities  to  be  symmetric,  which  in  our  setting 
corresponds  to  choosing  a  p  such  that  Vj,  j,  pij  =  p\j-i\ 
(i.e.,  the  tightest  TRW  upper  bound  on  A{d)  can  be  ob¬ 
tained  by  a  symmetric  p).  When  the  edge  appearance 


K 


E<^ 

/=i 


2,2+/  nl\ 


+  H{p), 
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probabilities  are  chosen  accordingly,  since  the  objective  is 
strictly  concave  and  the  variables  are  rotationally  symmet¬ 
ric,  it  follows  (Bui  et  al.,  2014,  Theorem  3)  that  the  opti¬ 
mum  must  satisfy  the  following  property: 

yi<i,j  <N,l<l<K,fi^  =  and 

(7) 

We  will  take  advantage  of  this  structural  property  to  dra¬ 
matically  simplify  the  variational  optimization  problem.  In 
particular,  using  the  notation  to  refer  to  the  single-node 
marginal  (there  is  only  one)  and  3:2)  to  refer  to  the 

edge  marginal  corresponding  to  the  potential  9\  we  have: 


A(0)=max7V  {ny,9°)  +  H{fiy) 

fi.  L 


(8) 


where  the  maximization  is  subject  to  the  non-negativity 
constraints  >  0,  sum-to-one  constraints 

=  1  andVZ,  'Exyx^  ^^Eixi,X2)  =  1,  and 
pairwise  consistency  constraints: 

=  My  (2:2)  yi,X2,  (9) 

Xi 

'^fiEixi,X2)  =  yi,xi.  (10) 

X2 


The  optimal  is  guaranteed  to  be  symmetric,  and  so  we 
could  have  used  a  slightly  more  compact  form  of  the  op¬ 
timization  problem  (c.f  Bui  et  al.,  2014).  However,  we 
prefer  this  form  both  because  it  is  easier  to  describe  and 
because  it  is  more  amenable  to  solving  efficiently. 

The  lifted  problem,  Eq.  8,  has  only  C  +  KC^  optimiza¬ 
tion  variables,  instead  of  the  N{C  +  KC^)  of  the  original 
objective.  However,  it  remains  to  figure  out  how  to  solve 
this  optimization  problem.  Bui  et  al.  (2014)  solve  the  lifted 
TRW  problem  using  Frank- Wolfe,  which  has  to  repeatedly 
solve  a  linear  program  over  the  same  feasible  space  (i.e., 
Eqs.  9  and  10).  These  linear  programs  would  be  huge  in 
our  setting,  where  C  can  be  as  large  as  10,  000,  leading  to 
prohibitive  running  times. 


4.4.  Dual  Decomposition 

We  now  derive  an  efficient  algorithm  based  on  dual  de¬ 
composition  to  optimize  our  lifted  TRW  objective.  We  will 
have  an  upper  bound  on  the  log-partition  function,  and  thus 
a  lower  bound  on  the  likelihood,  for  any  valid  edge  ap¬ 
pearance  probabilities.  However,  our  algorithm  requires  a 
specific  choice  for  all  edges:  V/,  p/  =  7^^. 

We  assume  that  the  corpus  length  iV  is  a  multiple  of  iT  -I- 1, 
which  can  always  be  achieved  by  adding  “filler”  {S)  to¬ 
kens.  To  prove  that  our  choice  of  p  defines  valid  edge 


Figure  3.  The  set  of  A”  -|-  1  covering  forests  used  for  K  =  3, 
N  =  16.  Each  edge  is  represented  in  exactly  one  forest. 


appearance  probabilities,  we  demonstrate  a  set  of  iT  -f  1 
forests  T  such  that  py  =  where  for  all  T, 

Pt  =  In  particular,  we  take  forests  which  are  made 

up  of  disconnected  stars,  rotated  so  that  each  edge  is  cov¬ 
ered  exactly  once.  Figure  3  illustrates  this  choice  of  forests. 


Using  this,  we  can  rewrite  the  objective  in  Eq.  8  as: 


m  = 


— ^max(iT  -b  1)[(m^,0°)  +  H{p°y) 

A  +  1  /i  L 

K 


(11) 


^((M^,(iT  +  l)0')-7(Mk))- 


1=1 


Finally,  rather  than  optimizing  over  Eq.  1 1  explicitly,  we 
re-write  it  in  a  form  in  which  we  can  use  a  belief  propa¬ 
gation  algorithm  to  perform  part  of  the  maximization.  To 
do  so,  we  introduce  redundant  variables  py  for  I  G  [1,  K], 
enforce  that  they  are  equal  to  py  and  use  them  instead  of 
Py  for  each  pairwise  consistency  constraint.  The  resulting 
equivalent  form  of  the  optimization  problem  is: 

N  ^  ^ 

=  j^—ura.yi'^{p'y,9^)  +  '^{p^E,{K  -b  1)0') 

^  1=0  i=i 

K  K 

+  ^  7J(My)  —  ^ /(m^),  (12) 

i=o  1=1 

subject  to  non-negativity  and  sum-to-1  constraints,  and: 

V(  €  [l,iT],X2,  ^PEixi,X2)  =  Py{X2) 

V(  e  [l,iT],xi,  ^/i^(a;i,X2)  =py{xi) 

X2 

yiG[l,K],xi,  Pv{xi)  =  Py{xi).  (13) 


If  one  ignores  the  equality  constraints  (13),  we  see  that 
the  constrained  optimization  problem  in  (12)  exactly  cor¬ 
responds  to  a  Bethe  variational  problem  for  the  tree- 
structured  MRF  shown  in  Figure  4.  As  a  result,  it  could  be 
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Figure  4.  The  tree  corresponding  to  the  maximization  sub¬ 
problem  in  the  lifted  inference,  for  K  —  3. 


maximized  in  linear  time  using  belief  propagation  (Wain- 
wright  &  Jordan,  2008,  Theorem  4.2b,  pg.  83-84). 

Our  next  step  is  to  introduce  these  constraints  in  a  way  that 
still  allows  for  efficient  optimization.  This  can  be  achieved 
through  the  use  of  Lagrangian  duality:  by  formulating  the 
right  dual  problem,  we  obtain  a  tight  bound  on  our  objec¬ 
tive  which  can  still  be  maximized  through  message  passing. 

We  introduce  Lagrange  multipliers  Sy(xi)  for  each  con¬ 
straint  in  (13)  and  form  the  Lagrangian  by  adding 
^v(^0(Fv(xi)  -  Fvixi))  to  the  objective 
(12).  Re-arranging  terms  and  omitting  the  constant,  we  ob¬ 
tain  that  the  dual  objective  is: 

K  K 

=  (Mv)  ~  ^v)  +  +  ^v)  (14) 

1=1  1=1 

+  (K  +  1)0')  + 

1  =  1  1=0  1=1 

Since  the  primal  problem  is  concave  and  strictly  feasible 
(it  is  feasible  with  no  inequality  constraints),  Slater’s  con¬ 
ditions  are  met  and  we  have  strong  duality.  Thus, 

N  r 

A{0)  <  A{0)  =  — — -  min  max  0*^(0, /x).  (15) 

iv  -b  1  S 

One  useful  property  of  the  above  is  that  we  have  a  valid  up¬ 
per  bound  on  ^(0),  the  log-partition  function  of  the  circular 
model,  for  any  choice  of  the  dual  variables  6.  For  a  fixed 
S,  computing  the  upper  bound  simply  requires  one  pass  of 
belief  propagation  in  the  tree  MRF  shown  in  Figure  4,  for 
a  running  time  of  0{KC^). 


5.  Learning  Algorithm 


Recall  that  our  goal  is  to  estimate  parameters  9  to  maximize 
£(0,  x;  p)  given  in  Eq.  6.  Letting  /tx  denote  the  observed 
moments  of  the  corpus  x(t‘^),  we  have  for  p  = 


=  iVx  (Ax,0)- 


£(0,x;p)  =7Vx  (Ax,0)-^(0) 

min^  max^g£C  O^{0,  p) 

K  +  l 

max^g£cO'^(0,M) 
=  Nx  max  I  9) - k  +  1 - 

=  N  X  max  C{9,  x;  (5), 

5 


Algorithm  1  Tightening  the  bound 
input:  model  parameters  9 

repeat 

compute  0(9,  6)  for  the  lifted  MRF  {  Eq.  (17)  } 
compute  p{9)  {BP  on  Eig.  4  MRE} 
compute  V()(0)  {  Eq.  (16)  } 

Take  sub-gradient  step:  =  6  —  aVS 

until  p  satisfies  primal  constraints  {  Eq.  (13)  } 
output:  C{9,x;  p),  pseudo-marginals  p 


Algorithm  2  Gradient  ascent 

input:  data  x(t)  =  precision  e,  initial  U,  V 

collect  pairwise  moments  (pu,v)(u,v)ex‘^  from  the  data 

repeat 

compute  0{U,W) 

compute  bound  £(0,x;p)  =  max^  £(0,  x;  5)  {Alg.l} 
compute  VgL{9,  x;  p)  (Eq.  (18)} 
compute  Vc/£  and  (Eq.  (19)} 
take  gradient  step  ^  U  +  V 
take  gradient  step  ^  IL  + 

until  convergence:  [E"®™  —  <  e 

output:  estimated  parameters  TF”®“ 


where  C{9,x.\  i5)  =  (/ix,  0)  —  ^  j-[ence,  for 

any  S,  N  x  C{9,  x;  i5)  defines  a  lower  bound  over  the  log- 
likelihood  of  x(t‘’),  which  can  be  made  tighter  by  optimiz¬ 
ing  over  6.  Moreover,  C{9,  x;  5)  is  jointly  concave  in  6  and 
9.  The  learning  algorithm  consists  of  alternating  between 
tightening  this  bound  (Algorithm  1),  and  taking  gradient 
steps  in  9  (Algorithm  2),  in  an  approach  similar  to  that  of 
Meshi  et  al.  (2010)  and  Hazan  &  Urtasun  (2010). 

Tightening  the  hound:  Eor  a  fixed  value  of  the 
parameters  9,  the  tightest  bound  is  obtained  for 
5*  =  argmin^  max^g^C  0^(9,  p).  We  can  find  this  min- 
imizer  through  a  sub-gradient  descent  algorithm.  In  partic¬ 
ular,  letting  p*  be  a  maximizer  of  0^(0,  p),  the  following 
is  a  sub-gradient  of  0^(9,  p)  in  5: 

VS’y=p^^-pI*  yiG[l,K].  (16) 

The  optimal  p*  corresponds  to  the  single  node  and  edge 
marginals  of  the  tree-structured  MRE  given  in  Eigure  4, 
which  can  be  computed  by  running  belief  propagation  with 
the  following  log-potentials: 

K 

9v  =  9^  ~  ^  '  Sy,  0y  =  9^  +  6y  Vf  €  [1,  K], 

1=1 

9^j^  =  {K  +  l)9^  Wle[l,K].  (17) 


K  +  l 


Gradient  Ascent:  The  marginals  computed  at  S*  can 
then  be  used  to  compute  gradients  for  our  main  objective. 
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Recall  that  our  aim  is  to  maximize  the  objective  function 
£(0,x;p)  =  N  X  £(0,  x;  (5*(0)),  where  5*(0)  is  the  output 
of  Algorithm  1 .  For  any  value  of  5,  even  before  optimality, 
we  have; 


Vg  max 

/j-G-CC 

Hence: 


(K  +  l)arg  max  O^{0,  fi). 

fi^CC 


VeC{0,  x;  p)  =  TV  X  (  /tx  -  arg  max  (0,  p)  )  .  (18) 
y  /j,gCC  J 

For  the  low -rank  MRFs,  the  gradients  in  the  parameters  can 
then  be  obtained  using  the  chain  rule.  For  the  factorization 
of  0  presented  in  Eq.  5,  we  get  for  u,v  G  X,  d  G  [1,  D], 
Ig[1,K]: 

EE 

^£(6',x;p)  =  ^  fc7„/,d  X  Ve„,  ^>C(6»,x;p) 
u'^X  ^ 

(19) 


These  can  be  used  to  perform  gradient  ascent  on  the  objec¬ 
tive  function,  as  outlined  in  Algorithm  2. 


6.  Experiments 

We  conducted  experiments  using  the  lifted  algorithm  to  ex¬ 
amine  its  practical  efficiency,  effectiveness  at  estimating 
gradients,  and  the  properties  of  the  tree  re-weighted  bound. 
We  implemented  models  for  two  standard  natural  language 
tasks;  language  modelling  and  part-of-speech  tagging. 

Setup  For  language  modelling  we  ran  experiments  on  the 
Penn  Treebank  (PTB)  corpus  with  the  standard  language 
modelling  setup:  sections  0-20  for  training  {N  =  930k), 
sections  21-22  for  validation  (N  =  74k)  and  sections  23- 
24  {N  =  82k)  for  test.  For  this  dataset  the  vocabulary  size 
is  C  =  10k,  and  rare  words  are  replaced  with  UNK. 

For  part-of-speech  tagging  we  use  the  tagged  version  of 
the  Penn  Treebank  corpus  (Marcus  et  ak,  1993).  We  use 
section  2-21  for  training,  section  22  for  validation  and  sec¬ 
tion  23  for  test.  For  this  corpus  the  tag  size  is  T  =  36  and 
we  use  the  full  vocabulary  size  with  C  ~  30k. 

For  model  parameter  optimization  (the  gradient  step  in  Al¬ 
gorithm  2)  we  use  L-BFGS  (Liu  &  Nocedal,  1989)  with 
backtracking  line-search.  For  tightening  the  bound  (Algo¬ 
rithm  1),  we  used  200  sub-gradient  iterations,  each  requir¬ 
ing  a  round  of  belief  propagation.  Our  sub-gradient  rate 
parameter  a  was  set  as  a  =  10^/2*  where  t  is  the  num¬ 
ber  of  preceding  iterations  where  the  dual  objective  did  not 
decrease.  Our  implementation  of  the  algorithm  uses  the 
Torch  numerical  framework  (http  :  /  /torch .  ch/)  and 
runs  on  the  GPU  for  efficiency. 


Figure  5.  Comparison  of  a  model  trained  by  optimizing  exact 
likelihood  (green)  versus  the  lifted  TRW  objective  (red).  The  blue 
line  shows  the  exact  log-likelihood  of  the  red  model  as  it  is  being 
optimized  based  on  the  lifted  TRW  bound. 


Figure  6.  The  red  and  blue  lines  give  lower  bounds  on  the  log- 
likelihood  (lifted  objective).  The  green  line  shows  the  fixed  value 
of  the  validation  log-likelihood  of  an  LBL  model  trained  on  PTB. 

Experiments  First,  to  confirm  the  properties  of  the  al¬ 
gorithm,  we  ran  experiments  on  a  small  synthetic  data  set 
with  N  =  12,  K  =  1  and  (7  =  4.  The  small  size  of  this 
data  set  allows  us  to  exactly  compute  the  log-partition  for 
the  original  conditional  model  (Equation  3). 

Eigure  5  shows  a  comparison  of  a  model  trained  using  the 
exact  gradients  on  the  conditional  likelihood  to  a  model 
trained  by  gradient  ascent  with  the  lifted  TRW  objective. 
As  expected,  the  latter  gives  an  underestimate  of  the  log- 
likelihood,  but  the  learned  parameters  yield  an  exact  log- 
likelihood  close  to  the  model  learned  with  exact  gradients. 

Next  we  applied  the  lifted  algorithm  to  a  language  mod¬ 
elling  task  on  PTB.  We  trained  both  the  explicit  full-rank 
model  and  the  model  with  low-rank  log-potentials  from 
Section  3,  0[\y^  =  for  1/  =  30  and  AT  =  2. 

The  results  are  presented  in  Eigure  6.  The  lower  bound  on 
the  likelihood  given  by  our  algorithm  is  only  slightly  lower 
than  the  exact  log-likelihood  computed  for  a  left-context 
LBL  model  with  K  —  2.  We  also  note  that  the  explicit 
model  is  prone  to  over-fitting,  and  gets  to  a  worse  valida¬ 
tion  objective. 

Another  advantage  of  using  low-rank  potentials  is  that  they 
produce  embedded  representations  of  the  vocabulary.  Ta- 
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Word 

MRF  Lifted 

MRF  SGD 

Word2vec 

firm 

he 

holding 

company 

industry 

it 

anacomp 

group 

corp. 

uniroyal 

conservative 

Vietnamese 

cross 

red 

freedom 

delegation 

tape 

black 

judge 

delicious 

had 

have 

had 

has 

is 

had 

been 

was 

n’t 

have 

currency 

intergroup 

currency 

dollar 

economy 

market 

pound 

government 

uses 

stabilized 

richard 

like 

kemp 

jack 

david 

needed 

porter 

carl 

first 

timothy 

Table  1.  Nearest  neighbours  in  different  embeddings.  MRF 
Lifted  are  the  embeddings  learned  by  our  algorithm.  MRF  SGD 
are  obtained  by  running  stochastic  gradient  descent  for  48  hours 
on  the  pseudo-likelihood  objective;  the  algorithm  did  not  con¬ 
verge  in  that  time.  Word2vec  are  the  vectors  learned  by  the 
Word2Vec  software  of  (Mikolov  et  ah,  2013) 

ble  1  shows  a  sample  of  embeddings  learned  for  the  MRF 
compared  to  those  obtained  with  the  Word2Vec  algorithm 
(with  D  =  100  and  a  window  size  of  4,  training  run  for  5 
epochs).  We  also  tried  training  our  algorithm  by  perform¬ 
ing  stochastic  gradient  descent  on  the  pseudo-likelihood 
of  the  corpus  under  our  model.  The  column  MRF  SGD 
shows  the  embeddings  obtained  after  48  hours  of  training. 
In  comparison,  the  GPU  implementation  of  our  algorithm 
reached  its  optimal  objective  value  on  the  validation  dataset 
in  45  minutes  on  the  Penn  Treebank  dataset. 

Finally  we  ran  experiments  on  part-of-speech  tagging.  For 
this  task  we  use  a  different  MRF  graphical  structure.  Each 
tag  node  is  connected  to  its  K  neighbors  as  well  as  the  L 
nearest-words.  We  use  a  different  set  of  covering  forests 
which  is  shown  in  Figure  7.  As  with  language  modelling 
the  partition  function  for  this  model  would  be  very  ineffi¬ 
cient  to  compute  explicitly.  However,  given  a  sentence,  the 
best  tagging  can  be  found  efficiently  by  dynamic  program¬ 
ming. 

For  this  model,  we  also  employ  explicit  features  for  pair¬ 
wise  potentials,  i.e.  ^I'^wi+rn.  ~ 

=  V^g{ti,ti+i)  where  U,  V  are  parameter  matrices 
and  /,  g  are  predefined  feature  functions.  For  g  we  use  tag- 
pair  indicator  features,  and  for  /  we  use  standard  features 
on  capitalization,  punctuation,  and  prefixes/suffixes  (given 
in  Appendix  B).  This  model  and  features  are  analogous 
to  a  standard  conditional  random  field  tagger;  however,  we 
optimize  for  joint  likelihood. 

It  is  known  that  joint  models  are  less  effective  than  discrim¬ 
inative  conditional  models  for  this  task  (Liang  &  Jordan, 
2008),  but  we  can  compare  performance  to  a  similar  joint 


Figure  7.  The  POS  tagging  model  for  A  =  2,L  =  3,  and  a 
decomposition  for  the  lifted  inference  algorithm 


Model 

Total  Acc 

Unk  Acc 

HMM 

95.8 

65.4 

Lifted  MRF 

96.0 

76.0 

Table  2.  Comparison  of  tagging  accuracy  between  the  lifted  MRF 
and  an  HMM  in  total  and  on  unseen  words. 

model.  We  compare  this  model  with  K  =  1  to  a  stan¬ 
dard  first-order  HMM  tagging  model  using  the  TnT  tagger 
(Brants,  2000)  with  simple  rare  word  smoothing.  Table  2 
shows  the  results.  The  lifted  model  achieves  similar  to¬ 
tal  accuracy,  but  has  much  better  performance  on  unseen 
words,  due  to  its  feature  structure. 

7.  Conclusion 

This  work  introduces  a  Markov  random  field  language 
model  that  extends  upon  NLMs,  and  presents  a  fast  lifted 
inference  algorithm  with  complexity  independent  of  the 
length  of  the  corpus.  We  show  experimentally  that  this 
technique  is  efficient  and  estimates  useful  parameters  on 
two  common  NLP  tasks.  The  use  of  low-rank  MRFs  may 
also  be  useful  in  other  applications  where  random  variables 
have  very  large  state  spaces. 

Our  paper  presents  a  new  application  area  for  lifted  in¬ 
ference,  and  could  potentially  lead  to  its  broader  adop¬ 
tion  in  machine  learning.  For  example,  one  could  ap¬ 
ply  our  methodology  to  efficiently  learn  the  parameters  of 
grid-structured  MRFs  commonly  used  in  computer  vision, 
where  symmetry  is  obtained  using  an  approximation  which 
wraps  the  grid  around  left-to-right  and  top-to-bottom.  Our 
dual  decomposition  algorithm  may  also  be  more  broadly 
useful  for  efficiently  performing  lifted  variational  infer¬ 
ence. 

Our  approach  opens  the  door  to  putting  a  much  broader 
class  of  word  embeddings  used  for  language  into  a  prob¬ 
abilistic  framework.  One  of  the  most  exciting  directions 
enabled  by  our  advances  is  to  combine  latent  variable  mod¬ 
els  together  with  neural  language  models.  For  example, 
one  could  imagine  using  our  approach  to  perform  semi- 
supervised  or  fully  unsupervised  learning  of  part-of-speech 
tags  using  vast  unlabeled  corpora.  Our  lifted  variational  in¬ 
ference  approach  can  be  easily  combined  with  Expectation 
Maximization  or  gradient-based  likelihood  maximization. 
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A.  Proof  of  Lemma  1 

Let  us  start  by  reviewing  some  of  the  notions  supporting 
the  model  presented  in  the  main  paper.  All  probability 
distributions  dehned  in  this  paper  correspond  to  Markov 
random  held  sequence  models,  so  we  begin  by  describing 
these  models  in  detail. 

A.I.  Sequence  models 

Linear-chain  Markov  sequence  model  We  start  by  con¬ 
sidering  a  sequence  x  =  (xi, . . . ,  x„)  of  n  variables  with 
state  space  X.  We  dehne  an  order  K  Markov  sequence 
model  as  a  Markov  random  held  where  each  element  of  the 
sequence  is  connected  to  its  K  left  and  right  neighbours. 
Figure  8  presents  such  a  model  for  n  =  8,  K  =  2. 


Figure  9.  Cyclic  second  order  Markov  sequence  model  for  n  =  8. 


Figure  8.  Second-order  Markov  sequence  model  for  n  =  8. 

As  mentioned  in  Section  3,  a  pairwise  MRF  with  this  struc¬ 
ture  gives  the  following  distribution  over 

n-K  K 

VX  e  A",  l0g(preq^  (X))  =  E  E  {0) 

i=l  1=1 

(20) 

Where: 


A.2.  Language  modelling 

To  apply  a  Markov  sequence  model  to  language  modelling 
we  also  need  to  explicitly  handle  the  boundary  cases  of  a 
sentence. 

Consider  a  linear-chain  Markov  sequence  model  over  a 
sentence  of  size  M,  let  T  denote  the  vocabulary  of  our  cor¬ 
pus,  and  dehne  the  bidirectional  context  of  a  word  as  its  K 
left  and  right  neighbouring  tokens.  By  adding  K  “padding” 
or  “separator”  tokens  {S)  ^  T  to  the  left  and  right  bound¬ 
ary  of  the  sentence,  this  notion  of  context  also  allows  us  to 
bias  the  distribution  of  tokens  at  the  beginning  and  end  of 
the  sentence. 


Cyclic  Markov  sequence  model  Now  consider  the 
cyclic  version  of  the  above  sequence  model,  where  the  last 
K  tokens  are  connected  to  the  hrst  K  (specihcally,  edges 
are  added  between  Vn-k  and  u;,  VI  <  k  +  I  <  K),  as 
illustrated  in  Figure  9. 

This  gives  the  following  distribution 

n  K 

Vx  G  A",  log(p”,,^  (x))  =  E  E  (9) 

i=l  1  =  1 

(21) 


Where: 

In  terms  of  the  sequence  model  dehned  above,  a 
sentence  t  G  will  then  correspond  to  a  se¬ 
quence  x(t)  G  with  X  =  TU  {(S')},  such  that 

x(t)f +f  =  t  and  x(t)f  =  x(t)^+^q  =  (S)^ . 

This  allows  us  to  dehne  the  following  distribution  over 


Ae"cl^(^)  =  log 


yeX’ 


exp 


K 


y  y 


Vi,Vi+l 


i=l  1=1 


And  yi  G  [1,  K\,  Xn+i  =  Xi,  Un+i  =  yi- 
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sentences  of  length  M : 


sentences; 


vter"^ 


K+M 

K+1 


=  t 


_  M+2K  _ 
■^1  ~  ^M+K+1  — 


(22) 


We  can  then  define  the  following  generative  process  for 
sentences  (as  illustrated  in  Figure  10); 


(23) 


xK 


xK 


xK 


•  Draw  the  sentence  length  M  from  a  distribution  over 
integers  t. 

•  Draw  a  sequence  of  M  tokens; 

=  {ti,  .  .  .,tM)  ~  . 

Under  this  model,  the  likelihood  of  a  corpus  = 
(t\  . . .  is  then; 


Let  us  first  consider  the  base  case  where  c  =  2.  From 
Equations  20  and  22,  we  get  that; 

Vj  e  {1, 2}  (t-^)  oc 

Mj+K  K 

0^exp(  ^ 

2=1  Z=1 


I 

p(t^)  =  nr(M,)p^'(U) 
2  =  1 


=  r(Mi, . . . ,  M„J 

2=1 


Since  the  maximum  likelihood  parameters  of  r  can  easily 

ric 

be  estimated,  we  focus  on  the  second  part 

i=l 

the  rest  of  the  proof. 


A.3.  Proving  the  Lemma 


Now  we  consider  the  lemma  of  interest  relating  the  linear- 
chain  Markov  sequence  model  to  the  cyclic  model.  We  re¬ 
state  the  lemma  here; 


Lemma. 

Then, 


Let  5  =  {x(t'^)|t‘=  e  X  ...  X  r^"c}. 


'I'C 

2=1 


Pcycl^i^^S) 


Figure  11.  Concatenating  sentences  and 
Additionally,  we  have  by  construction; 

vze[i,Ar],  x(t^)^^_|_^^;  =  x(t^); 

=  {S) 

Hence; 


Our  proof  first  shows  how  to  chain  together  sentences  in 
a  corpus,  and  then  applies  the  cyclic  Markov  sequence 
model. 

Concatenating  sentences  Consider  a  corpus  of  c  sen¬ 
tences  x(t'^)  =  (t^, . . . ,  t"^)  (of  lengths  (Mi, . . . ,  Me)) 
independently  drawn  from  the  above  model.  As  above, 
we  can  use  a  mapping  x  of  t  to  where; 

N  =  K  +  Ml  +  . . .  +  K  +  Mn„,  by  adding  (S')  tokens  at 
the  beginning  and  end  of  the  corpus  and  between  adjacent 


M1+M2+2K 

51/  51/  — 

Mi+K 

55  55 

i=l  l^lK 
M2+K 

+  55  55  ^x(t2)^.,x{t2)^.+, 

3=1  1=1^ 
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In  other  words: 

M^+K  K 

pMi(ti)pM.(t2)  ^exp(  £ 


i^l  1^1 

M2  +  K  K 

X  exp(  £  £ 

M-1+M2+2K 

\  ^  r\l 

(tl,t2).,x(tl,t2) 


X(t2).,x(t2)._^j 


X  exp( 

E 

E^^x, 

2=1 

i=i’< 

=  x(t^, 

By  induction,  we  get  that; 

ric 

]Jp^‘(t‘)  (X  P^q+*'(x  =  x(t)) 

Now,  let  Sn  =  {x(t)|t  €  7*^^  X  ...  X  Since  the 

text  model  is  defined  for  x  G  Sn,  by  normalization,  it  then 
follows  that: 


np^-(t‘)= 


^  x(t)) 

Pseq*j^^(x  €  Sm) 

N+R 
Veq^ 


=  Pfeqfi^  =  X(t)|x  e  Sn)  (24) 


Using  a  cyclic  model  Finally,  Vx  G  Sn,  we  have  that 
yi  G  \l,K],xi  =  xn+i  =  {S).  According  to  Equations  20 
and  21,  this  means  that: 


WxGSn,  (X)  X  (X) 


Hence: 


Which  by  normalization  gives  us: 

tl  Pcyd^i^y^N) 

=  Fc^cl^(x  =  x(t)|xG5Ar) 

Which  proves  the  lemma. 


(25) 


Language  modelling  experiments  In  our  implementa¬ 
tion  of  the  inner  loop  of  the  algorithm  (Algorithm  1  of  the 
main  paper),  we  use  LBFGS  to  find  the  optimal  value  of 
6.  However,  as  mentioned  in  Section  4,  the  inner  loop  does 
not  need  to  be  run  to  optimality  to  find  an  ascent  direction. 

The  LBL  model  in  Figure  6  was  trained  using  SGD  on 
minibatches  of  size  64.  The  learning  rate  was  initialized 
at  0.1,  and  halved  any  time  the  error  validation  went  up. 

The  second  model  presented  in  Table  1  (MRF  SGD)  was 
trained  by  running  SGD  on  the  model  pseudo-likelihood, 
with  minibatches  of  size  100.  The  learning  rate  was  initial¬ 
ized  at  0.025  and  decayed  as  . 

Sequence  tagger  features  For  part-of-speech  tagging 
experiments  we  make  use  of  two  feature  functions 
g(ti,  ti+i)  and  f{ti,  Wi+m)-  The  tag-tag  function  g  simply 
consists  of  indicator  features  for  all  possible  pairs  of  tags. 
The  feature  function  f{ti,Wi+m)  conjoins  an  indicator  of 
the  tag  ti  with  surface-form  features  including: 

•  An  indicator  for  the  word  Wi+m  itself. 

•  Prefixes  and  suffixes  of  Wi+m  up  to  length  4. 

When  m  =  0,  i.e.  the  potential  with  the  tag  directly  above 
a  word,  the  tag  is  further  conjoined  with  a  standard  set  of 
morphology  features  including: 

•  Is  Wi  completely  upper  case? 

•  Is  the  first  letter  of  Wi  upper  case? 

•  Does  Wi  end  with  ‘s’? 

•  Is  the  first  letter  of  Wi  upper  case  and  it  ends  with  ‘s’? 

•  Is  Wi  completely  upper-case  and  it  end  with  ‘S’? 

•  Does  Wi  contain  a  digit? 

•  Is  Wi  all  digits? 

•  Does  Wi  contain  a  hyphen? 


B.  Implementation  Details 

Synthetic  data  generation  The  synthetic  data  used  to 
obtain  the  results  presented  in  Figure  5  consists  of  a  se¬ 
quence  of  12  tokens  sampled  uniformly  at  random  from 
T  =  {a,b,c,d}.  For  K  =  2  this  gives  a  sequence  of  the 
form: 


(S)  {S)  abcdbabdcbac  (S')  (S) 


